1. Decision Trees

(a) Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of the figure above. The numbers inside the boxes indicate the mean of Y within each region.

Answer

Decision Tree.

Decision Tree.

(b) Create a diagram similar to the left-hand panel of the figure, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

Answer

Decision Tree.

Decision Tree.

2. Regression Trees

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

Answer

library(ISLR)
attach(Carseats)
set.seed(1)

train <- sample(dim(Carseats)[1],dim(Carseats)[1]/2) #train sample
Carseats.train <- Carseats[train,] #Creating Carseats.train with training data
Carseats.test <- Carseats[-train,] #Creating Carseats.test with test data

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test error rate do you obtain?

Answer

library(tree)
tree.carseats <- tree(Sales ~ ., data = Carseats,subset = train) # Fitting Regression Tree
summary(tree.carseats) #Summary the results
## 
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats, subset = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "Income"     
## [6] "CompPrice"  
## Number of terminal nodes:  18 
## Residual mean deviance:  2.36 = 429.5 / 182 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.2570 -1.0360  0.1024  0.0000  0.9301  3.9130
plot(tree.carseats)  #Plot the tree
text(tree.carseats, pretty = 0)

pred.carseats <- predict(tree.carseats,Carseats.test) #predict
mean((Carseats.test$Sales - pred.carseats)^2) #MSE
## [1] 4.148897
sqrt(mean((Carseats.test$Sales - pred.carseats)^2))
## [1] 2.036884

We have produced a regression tree with Sales as the response and the other variables as predictors. The tree has actually used six variables “ShelveLoc”, “Price”, “Age”, “Advertising”,“Income”,“CompPrice”. The most important variable in order to predict sales is ShelveLoc as the second most important is Price.

Our tree has eighteen (18) terminal nodes and the Residual mean deviance for the tree is 2.36.A small residual mean deviance indicates a good fit to the training data. Test MSE of our tree is 4.1.

In order to intepret the results better we can calculate RMSE which is equal to 2.04. That means that when we make a prediction based on our model,we expected to miss on average almost 2 units of sales in any prediction.

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test error rate?

Answer

set.seed(1)
#Using cv.tree() function to determine the optimal level of tree complexity
cv.carseats <- cv.tree(tree.carseats)
cv.carseats
## $size
##  [1] 18 17 16 15 14 12 11 10  9  8  7  6  5  4  3  1
## 
## $dev
##  [1] 1127.510 1163.270 1163.270 1169.097 1169.097 1153.636 1129.237
##  [8] 1139.639 1139.232 1109.839 1135.897 1126.203 1218.821 1205.007
## [15] 1331.927 1612.664
## 
## $k
##  [1]      -Inf  15.48181  15.53599  18.69038  18.74886  21.05038  23.79480
##  [8]  25.78579  26.01210  30.10435  32.74801  53.28569  72.33061  78.19599
## [15] 141.73781 251.22901
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

In this case,tree of size 8 is selected by cross-validation as the deviance has the lowest value equal to 1109.839 .

We can also plot results to have a better look between size and deviance. We can easily observe that in our case a tree size of 8 gives as the lowest cross validation error as the deviance has the lowest price.

set.seed(1)
plot(cv.carseats$size,cv.carseats$dev,type='b')

# Pruning a Regression Tree , best size = 8 
prune.carseats <- prune.tree(tree.carseats,best=8)
summary(prune.carseats) # summary the results with 8 terminal nodes
## 
## Regression tree:
## snip.tree(tree = tree.carseats, nodes = c(39L, 11L, 8L, 18L, 
## 7L, 10L))
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price"     "Age"      
## Number of terminal nodes:  8 
## Residual mean deviance:  3.363 = 645.7 / 192 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -5.28400 -1.12000  0.02695  0.00000  0.99010  4.59800
plot(prune.carseats) # plot the prunned tree
text(prune.carseats, pretty = 0)

yhat.prune = predict(prune.carseats,newdata = Carseats.test) 
mean((yhat.prune - Carseats.test$Sales)^2)
## [1] 5.09085

By using the pruned tree to make predictions on the test set we estimated test MSE to be higher(test MSE = 5.1) than the unpruned tree. That would generally mean that our model may suffer from underfitting.

It is worth mentioning that, a pruned tree will increase the bias and if the reduction of the variance is not significant we may see an increase to MSE(as in our case). Although, we get higher intepretability in our model as we have smaller size in our tree.

(d) Use the bagging approach in order to analyze this data. What test error rate do you obtain? Use the importance() function to determine which variables are most important.

Answer

set.seed(1)
library(randomForest) #insert the library
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
bag.carseats <- randomForest(Sales ~. , data = Carseats.train, mtry = 10,ntree=500, importance = TRUE) # number of predictors sampled for splitting at each node // 10 indicates that all 10 predictors should be consider for each split of the tree, this indicates bagging
yhat.bag <- predict(bag.carseats, newdata = Carseats.test)
mean((yhat.bag - Carseats.test$Sales)^2)
## [1] 2.614642

We can see how the bagging approach decreases the Test MSE to 2.6.We construct B regression trees where B is our boostrapped training datasets and then we average the resulting predictions of those B datasets in order to estimate the MSE.

It’s worth mentioning, that those B trees are not pruned, so the value of bias estimated to be very low but by averaging these tree we can achieve also a lower variance.As a result, we have a reduction in our MSE which is of course the sum of bias and variance.

In addition, we can now use the importance() function to determine which variables are most important.

importance(bag.carseats)
##                %IncMSE IncNodePurity
## CompPrice   16.4714051    126.605047
## Income       4.0561872     78.821925
## Advertising 16.2730251    122.793232
## Population   0.7711188     62.796112
## Price       54.5571815    512.940454
## ShelveLoc   42.4486118    320.749734
## Age         20.5369414    184.804253
## Education    2.7755968     42.427788
## Urban       -2.3962157      8.583232
## US           7.2258536     17.605661

We can conclude that “Price” and “ShelvecLoc” are the most important variables in order to predict Sales.

(e) Use random forests to analyze this data. What test error rate do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

Answer

set.seed(1)
rf.carseats <- randomForest(Sales ~ ., data = Carseats.train, mtry = 3, ntree = 500, importance = TRUE)
yhat.rf <- predict(rf.carseats , newdata = Carseats.test)
mean((yhat.rf - Carseats.test$Sales)^2)
## [1] 3.237463

Random Forests have the same idea as bagging approach,although, each time we build a tree on bootstrapped training sample, we choose m = \(\sqrt{p}\) or m = p/3 predictors from the full set of p predictors(bagging approach).By averaging many highly correlated quantities, we don’t get large reduction to variance. In order to avoid that, we leave very strong predictors randomly out in each split and “de-corelate” the bagged trees leading to more reduction in variance.

In this case, random forests with m = \(p/3\) (usually picked for regression trees) have a higher test MSE equal to 3.2 than the bagging approach where we used all the predictors(m=p) for our B different regression trees.That means that we need those strong predictors inside our each split in order to have a better prediction. Although,we can also see a reduction to the MSE compared to our first simple regression tree methods (pruned and unpruned trees).

importance(rf.carseats)
##                %IncMSE IncNodePurity
## CompPrice    8.0753055     122.68542
## Income       4.7774822     128.79339
## Advertising 12.6417009     139.29703
## Population   0.6016753      94.79730
## Price       37.6994519     382.76169
## ShelveLoc   32.8805652     239.26395
## Age         17.6866850     208.74153
## Education    0.4932141      66.98484
## Urban        0.5287570      16.29274
## US           7.8534356      31.39177

We may notice that, in this case “Price” and “Shelveloc” are the two most important values in order to predict the Sales of a car . Its worth mentioning, that random forests de-corelates our two most important values of Price and Shelveloc and drop their values to 37 and 32 compared to 52 and 42 as we expected.

3. Classification Trees

This problem involves the OJ data set which is part of the ISLR package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

Answer

library(ISLR)
set.seed(1)
summary(OJ) #summary our data
##  Purchase WeekofPurchase     StoreID        PriceCH         PriceMM     
##  CH:653   Min.   :227.0   Min.   :1.00   Min.   :1.690   Min.   :1.690  
##  MM:417   1st Qu.:240.0   1st Qu.:2.00   1st Qu.:1.790   1st Qu.:1.990  
##           Median :257.0   Median :3.00   Median :1.860   Median :2.090  
##           Mean   :254.4   Mean   :3.96   Mean   :1.867   Mean   :2.085  
##           3rd Qu.:268.0   3rd Qu.:7.00   3rd Qu.:1.990   3rd Qu.:2.180  
##           Max.   :278.0   Max.   :7.00   Max.   :2.090   Max.   :2.290  
##      DiscCH            DiscMM         SpecialCH        SpecialMM     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.05186   Mean   :0.1234   Mean   :0.1477   Mean   :0.1617  
##  3rd Qu.:0.00000   3rd Qu.:0.2300   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :0.50000   Max.   :0.8000   Max.   :1.0000   Max.   :1.0000  
##     LoyalCH          SalePriceMM     SalePriceCH      PriceDiff      
##  Min.   :0.000011   Min.   :1.190   Min.   :1.390   Min.   :-0.6700  
##  1st Qu.:0.325257   1st Qu.:1.690   1st Qu.:1.750   1st Qu.: 0.0000  
##  Median :0.600000   Median :2.090   Median :1.860   Median : 0.2300  
##  Mean   :0.565782   Mean   :1.962   Mean   :1.816   Mean   : 0.1465  
##  3rd Qu.:0.850873   3rd Qu.:2.130   3rd Qu.:1.890   3rd Qu.: 0.3200  
##  Max.   :0.999947   Max.   :2.290   Max.   :2.090   Max.   : 0.6400  
##  Store7      PctDiscMM        PctDiscCH       ListPriceDiff  
##  No :714   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  Yes:356   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.140  
##            Median :0.0000   Median :0.00000   Median :0.240  
##            Mean   :0.0593   Mean   :0.02731   Mean   :0.218  
##            3rd Qu.:0.1127   3rd Qu.:0.00000   3rd Qu.:0.300  
##            Max.   :0.4020   Max.   :0.25269   Max.   :0.440  
##      STORE      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :2.000  
##  Mean   :1.631  
##  3rd Qu.:3.000  
##  Max.   :4.000
train <- sample(dim(OJ)[1],800) #train sample with 800 observations
OJ.train <- OJ[train,]  #create OJ.train
OJ.test <- OJ[-train,] #create OJ.test(remaining observations)

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

Answer

library(tree)

oj.tree <- tree(Purchase ~ ., data = OJ.train,method="class") # Fit a tree to the training data with Purchase as the response and the other variables as predictors
summary(oj.tree) #summary our tree
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train, method = "class")
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7305 = 578.6 / 792 
## Misclassification error rate: 0.165 = 132 / 800

We have produced a classification tree with Purchace as the response and the other variables as predictors. The tree has actually used four variables “LoyalCH”, “PriceDiff”, “SpecialCH” and “ListPriceDiff”. Our tree has eight (8) terminal nodes and the Training Error rate (Misclassification error rate) for the tree is 0.165

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

Answer

oj.tree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1064.00 CH ( 0.61750 0.38250 )  
##    2) LoyalCH < 0.508643 350  409.30 MM ( 0.27143 0.72857 )  
##      4) LoyalCH < 0.264232 166  122.10 MM ( 0.12048 0.87952 )  
##        8) LoyalCH < 0.0356415 57   10.07 MM ( 0.01754 0.98246 ) *
##        9) LoyalCH > 0.0356415 109  100.90 MM ( 0.17431 0.82569 ) *
##      5) LoyalCH > 0.264232 184  248.80 MM ( 0.40761 0.59239 )  
##       10) PriceDiff < 0.195 83   91.66 MM ( 0.24096 0.75904 )  
##         20) SpecialCH < 0.5 70   60.89 MM ( 0.15714 0.84286 ) *
##         21) SpecialCH > 0.5 13   16.05 CH ( 0.69231 0.30769 ) *
##       11) PriceDiff > 0.195 101  139.20 CH ( 0.54455 0.45545 ) *
##    3) LoyalCH > 0.508643 450  318.10 CH ( 0.88667 0.11333 )  
##      6) LoyalCH < 0.764572 172  188.90 CH ( 0.76163 0.23837 )  
##       12) ListPriceDiff < 0.235 70   95.61 CH ( 0.57143 0.42857 ) *
##       13) ListPriceDiff > 0.235 102   69.76 CH ( 0.89216 0.10784 ) *
##      7) LoyalCH > 0.764572 278   86.14 CH ( 0.96403 0.03597 ) *

We will pick randomly node 20) in order to interpred the information displayed.

First, we can see the name of the splitting variable (SpecialCH) and the split criterion which in this case is SpecialCH < 0.5. Then we have the number of observations in that branch (70) and the deviance = 60.89.Next, MM is the the classification of Sales chosen from the possible classes of CH or MM. In addition,we get the fraction of observations in that branch that take on values of CH(16%) and MM(84%) where the first fraction indicates CH and the second indicates MM. Lastly, asterisk ( * ) indicates that our branch lead to terminal node.

(d) Create a plot of the tree, and interpret the results.

Answer

plot(oj.tree)
text(oj.tree, pretty = 0)

We can conclude that “LoyalCH” is the most important variable of the tree. If we have values that LoyalCH is smaller than 26% we predict that our sales value is MM otherwise if LoyalCH is greater than 76% our sales value is CH. In the middle where 26% < LoyalCH < 76% our decision depends also from PriceDiff , ListPriceDiff and SpecialCH.It is worth mentioning, that the response value of MM is predicted in both situation on the left side of our tree. The split is performed because it leads to increased node purity even though it does not reduce the classification error.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

Answer

oj.pred.test <- predict(oj.tree, OJ.test, type = "class")
table(oj.pred.test,OJ.test$Purchase)
##             
## oj.pred.test  CH  MM
##           CH 147  49
##           MM  12  62
missclass <- sum(OJ.test$Purchase != oj.pred.test)
missclass/length(oj.pred.test) # (49+12)/(147+62+49+12) test error rate 
## [1] 0.2259259

Test Error Rate is approximately 22.6 %.

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

Answer

set.seed(1)
cv.oj <- cv.tree(oj.tree,FUN=prune.misclass)
cv.oj
## $size
## [1] 8 5 2 1
## 
## $dev
## [1] 152 152 161 306
## 
## $k
## [1]       -Inf   0.000000   4.666667 160.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

We can see that the lowest cross validation error is for the size = 5 and size = 8 where the cross validation error is equal to 152. Early pruning method, indicates that we have to prune our tree at size = 5 if there is no significant reduce in our cross validation error after that point. By pruning the tree to be size = 5 , we can get no only a better interpetation but also our model is more efficient.

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

Answer

The plot of size and deviance indicates that a tree with 5 terminal nodes and 8 terminal nodes will have the same deviance as we expected.

plot(cv.oj$size, cv.oj$dev, type = "b" , xlab ="Tree size", ylab = " Deviance")

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

Answer

We can observe that a tree size of 5 nodes and tree of 8 nodes have the lowest cross-validation error.

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

Answer

oj.prune <- prune.misclass(oj.tree, best = 5)
plot(oj.prune)
text(oj.prune,pretty=0)

We can see that our pruned tree with size = 5 contains thre variables LoyalCH, PriceDiff and SpecialCH as in the unpruned tree we had 4. Of course, LoyalCH is the most important variable, as the PriceDiff is the second most important.

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(oj.prune)
## 
## Classification tree:
## snip.tree(tree = oj.tree, nodes = 3:4)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff" "SpecialCH"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.8256 = 656.4 / 795 
## Misclassification error rate: 0.165 = 132 / 800

Training Error Rate in pruned tree is 0.165 same asthe training error rate in the unpruned tree. As it has already been mentioned, with smaller tree we gain higher intepretability in our model and it is more efficient because we build smaller tree.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

Answer

yhat.pruned <- predict(oj.prune, OJ.test, type = "class")
table(yhat.pruned,OJ.test$Purchase)
##            
## yhat.pruned  CH  MM
##          CH 147  49
##          MM  12  62
missclass.pruned <- sum(OJ.test$Purchase != yhat.pruned)
missclass.pruned/length(yhat.pruned)
## [1] 0.2259259

In this case, Pruned tree has exactly the same test error rate equal to 22.6% as the unpruned tree. Of course, this could change if we set different seeds during the spliting of the data into test and training set and also the cross validation stage.

4. SVM

In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.

Answer

library(ISLR)
Auto_b <- Auto
gas.median <- median(Auto$mpg)
bin.var <- ifelse(Auto$mpg > gas.median, 1 , 0) #create the binary variable
Auto_b$mpglevel = as.factor(bin.var)
Auto_b$mpg <-NULL #remove mpg as we have already created mpglevel
Auto_b$name <- NULL #remove name

(b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.

Answer

library(e1071)
set.seed(1)
tune.out <-tune(svm,mpglevel~.,data=Auto_b,kernel="linear",ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100))) 
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   100
## 
## - best performance: 0.08679487 
## 
## - Detailed performance results:
##    cost      error dispersion
## 1 1e-03 0.13544872 0.05715486
## 2 1e-02 0.09179487 0.05812971
## 3 1e-01 0.09948718 0.06331482
## 4 1e+00 0.09192308 0.05827672
## 5 5e+00 0.08935897 0.05443327
## 6 1e+01 0.08935897 0.05443327
## 7 1e+02 0.08679487 0.05567394

After, we have compared SVMs with a linear kernel using a range of values of the cost parameter we found that cost=100 has the best performance of 0.087.On the other hand, we can observe that for cost = 0.001 we have the highest cross-validation error to be 0.135. In order to find the optimum parameter, tune function has used by default 10-fold cross validation and found all differents results for differents cost values.

(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.

Answer

set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto_b, kernel = "polynomial" , ranges = list(cost = c(0.001,0.01,0.1,1,5,10,100), degree = c(2,3,4)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##     5      3
## 
## - best performance: 0.07910256 
## 
## - Detailed performance results:
##     cost degree      error dispersion
## 1  1e-03      2 0.56115385 0.04344202
## 2  1e-02      2 0.48455128 0.08604319
## 3  1e-01      2 0.28333333 0.08903669
## 4  1e+00      2 0.25775641 0.10016352
## 5  5e+00      2 0.18846154 0.07426805
## 6  1e+01      2 0.18846154 0.06917535
## 7  1e+02      2 0.16570513 0.07729330
## 8  1e-03      3 0.46711538 0.05079366
## 9  1e-02      3 0.26814103 0.09451019
## 10 1e-01      3 0.20429487 0.10736879
## 11 1e+00      3 0.08929487 0.06309033
## 12 5e+00      3 0.07910256 0.05600672
## 13 1e+01      3 0.07910256 0.05600672
## 14 1e+02      3 0.08416667 0.05540253
## 15 1e-03      4 0.52544872 0.06493340
## 16 1e-02      4 0.37788462 0.06752086
## 17 1e-01      4 0.27076923 0.09243581
## 18 1e+00      4 0.21935897 0.08632997
## 19 5e+00      4 0.19653846 0.10103932
## 20 1e+01      4 0.17339744 0.09552369
## 21 1e+02      4 0.15038462 0.07729445

For polynomial SVMs we get the best performance (lowest cross-validation error) to be 0.079 for cost = 5 and degree = 3. It is worth mentioning,that tune function has used 10-fold cross validation by default and had found exactly the same error for cost = 10 and degree = 3.Last but not least,we have the highest cross-validation error equal to 0.56 for cost=0.01 and degree=2

set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto_b, kernel = "radial" , ranges = list(cost = c(0.1,1,10,100,1000), gamma = c(0.5,1,2,3,4)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1     1
## 
## - best performance: 0.07391026 
## 
## - Detailed performance results:
##     cost gamma      error dispersion
## 1  1e-01   0.5 0.08923077 0.05559893
## 2  1e+00   0.5 0.07391026 0.05315495
## 3  1e+01   0.5 0.07897436 0.04567335
## 4  1e+02   0.5 0.10448718 0.06862934
## 5  1e+03   0.5 0.10448718 0.06072287
## 6  1e-01   1.0 0.08923077 0.06408390
## 7  1e+00   1.0 0.07391026 0.05176240
## 8  1e+01   1.0 0.08916667 0.05799733
## 9  1e+02   1.0 0.10698718 0.07386765
## 10 1e+03   1.0 0.10698718 0.07386765
## 11 1e-01   2.0 0.13512821 0.05778941
## 12 1e+00   2.0 0.07634615 0.05879482
## 13 1e+01   2.0 0.09423077 0.07220895
## 14 1e+02   2.0 0.09923077 0.07799865
## 15 1e+03   2.0 0.09923077 0.07799865
## 16 1e-01   3.0 0.30128205 0.11716574
## 17 1e+00   3.0 0.07634615 0.05867200
## 18 1e+01   3.0 0.09673077 0.06968463
## 19 1e+02   3.0 0.09923077 0.07316609
## 20 1e+03   3.0 0.09923077 0.07316609
## 21 1e-01   4.0 0.53820513 0.07278281
## 22 1e+00   4.0 0.08660256 0.05385501
## 23 1e+01   4.0 0.10179487 0.06661035
## 24 1e+02   4.0 0.10179487 0.06966782
## 25 1e+03   4.0 0.10179487 0.06966782

Finally, for radial kernel, the lowest cross-validation error is 0.074 for cost = 1 and gamma = 1 which is the best performance between all models (linear,polynomial,radial).

(d) Make some plots to back up your assertions in (b) and (c).

Answer

svm.linear1 <- svm(mpglevel ~ weight + acceleration, data = Auto_b, kernel = "linear", cost = 100) #fit a svm linear model only for mpg with predictors to be weight abd acceleration
plot(svm.linear1, Auto_b, weight~acceleration)

svm.poly <- svm(mpglevel ~ displacement + cylinders,data = Auto_b, kernel = "polynomial",cost =5,degree=3)
plot(svm.poly,Auto_b, displacement~cylinders ) # 

svm.radial2 <- svm(mpglevel ~ origin + horsepower, data = Auto_b ,kernel = "radial", cost = 1 , gamma = 1)
plot(svm.radial2,Auto_b,horsepower~origin) #svm with radial kernel , plot mpg ~ acceleration

svm.linear2 <- svm(mpglevel ~ weight + cylinders, data = Auto_b, kernel = "linear", cost = 100)
plot(svm.linear2,Auto_b,weight~cylinders) # svm with linear kernel, plot mpg ~weight

svm.radial3 <- svm(mpglevel ~ acceleration + horsepower, data = Auto_b ,kernel = "radial", cost = 1 , gamma = 1)
plot(svm.radial3,Auto_b,horsepower~acceleration)

We can easily observe from the plots, how in general radial kernels seperate our data in a better and more flexible way. This is what we expected to see, as we have already got the lowest cross validation error.

It is also easy to observe, how the model can overfit the data if we raise sharply our gamma parameter in the graph below.In that case, we expect to have of course higher test MSE and acceleration and hoursepower would not predict very well whether a car gets high or low gas mileage (pink and blue font) .

svm.radial4 <- svm(mpglevel ~ acceleration + horsepower, data = Auto_b ,kernel = "radial", cost = 1 , gamma = 100)
plot(svm.radial4,Auto_b,horsepower~acceleration)

5. SVM

Here we explore the maximal margin classifier on a toy data set. (a) We are given n = 7 observations in p = 2 dimensions. For each observation, there is an associated class label. Sketch the observations.

Answer

x1 <- c(3,2,4,1,2,4,4)
x2 <- c(4,2,4,4,1,3,1)
colours = c("red","red","red","red","blue","blue","blue")
plot(x1,x2,col = colours , xlab = "X1", ylab = "X2" )

(b) Sketch the optimal separating hyperplane, and provide the equation for this hyperplane of the following form.

\(\beta_0 + \beta_1X_1 + \beta_2X_2 = 0\)

Answer

In our case,classes are linearly seperable and we fit the optimal seperating hyperplane between observations(support vectors) ‘2’,‘3’ and ‘5’,‘6’.

For the first pair of observations we have (2,2) and (2,1) and the second pair is (4,4) and (4,3) if we substract them we have

=> (2,1.5),(4,3.5) pair in which we will fit our hyperplane

\(y = mx + b\)

Gradient(slope) \(m = (y_2 - y_1)/(x_2 - x_1)\)

In our case \(m = (3.5-1.5)/(4-2) = 1\)

so for m = 1 => y = x + b

We take one point of our line e.g. (2,1.5) we have 1.5 = 2 + b => b = -0.5

so for b = -0.5 and m = 1 our equation will be calculated as \(X_1 = X_2 - 0.5\) and our hyperplane would be \(0.5 + X_1 - X_2 = 0\)

plot(x1,x2,col = colours,xlim = c(0,5),ylim = c(0,5))
abline(-0.5,1)

(c) Describe the classification rule for the maximal margin classifier. It should be something along the lines of “Classify to Red if \(\beta_0 + \beta_1X_1 + \beta_2X_2 > 0\), and classify to Blue otherwise.” Provide the values for \(\beta_0, \beta_1\), and \(\beta_2\).

Answer

In general, we can use an infinite number of hyperplanes in order to seperate our data.After we found a possible hyperplane we can compute the perpendicular distance from each observation to our hyperplane,then the minimum distance is called the margin. As a result,the maximal margin hyperplane is a hyperplane that has the largest margin or the farthest minimum distance to the observations. Building a maximal margin hyperplane gives us the ability to classify further a test observation based on which side of the maximal margin hyperplane it is. . In our case the classification rule for the maximal margin classifier will be : “Classify to Red if \(0.5 + X_1 - X_2 > 0\) and”Classify to Blue otherwise"

If we want for example to classify a test obsercation (\(X_1,X_2\)) we will apply the classification rule for the maximal margin classifier and see where our observation lies.

(d) On your sketch, indicate the margin for the maximal margin hyperplane.

Answer

plot(x1,x2,col=colours,xlim = c(0,5) , ylim = c(0,5))
abline(-0.5,1)
abline(-1,1,lty = 2) # blue support vectors (they are on the margin and support the maximal margin line)
abline(0,1,lty = 2) #red support vectors (they are on the margin and support the maximal margin line)
lines(x = c(2,1.9), y = c(1,1.39))
text(x = 2.2 , y = 1.4 , labels = "margin", cex = 0.8)
text(x = 2.4 , y = 2.1 , labels = "margin", cex = 0.8)
lines(x = c(2,2.1),y = c(2,1.60))

The margin is the perpendicular distance between support vectors and the seperation line(hyperplane). From the general formula the distance from a point \((x_0,y_0)\) to the line \(\beta_0 + \beta_1X_1 + \beta_2X_2=0\) is:

\(d = \frac{\mid{x_0\beta_1 + y_0\beta_2 + \beta_0}\mid}{\sqrt{\beta_1^2 + \beta_2^2}}\)

so for \((2,2)\) and \(X_1 - X_2 -0.5=0\) , \(margin = \frac{\sqrt2}{4}\)

(e) Indicate the support vectors for the maximal margin classifier.

Answer

The support vectors are the points (2,1),(2,2),(4,3),(4,3). These four observations, are exactly on the margin and “support” the maximal margin hyperplane. If these four points were moved, then the maximal margin line yould move as well.

plot(x1,x2,col=colours,xlim = c(0,5) , ylim = c(0,5))
abline(-0.5,1)
abline(-1,1,lty = 2)  
abline(0,1,lty = 2)
points(2,2, pch=13,col="black",cex=2)
points(2,1, pch=13,col="black",cex=2)
points(4,4, pch=13,col="black",cex=2)
points(4,3, pch=13,col="black",cex=2)

(f) Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.

Answer

A slight movement of the seventh observation (4,1) will not affect the maximal margin hyperplane. The maximal margin hyperplane as we said before depends only from the supports vectors which in our case are (2,1),(2,2),(4,3),(4,4). Any movements of the other observations will not affect our hyperplane as long as they stay outside the boundary of our margin.

(g) Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.

Answer

We can choose a line that seperates our two classes but without being the optimal seperating hyperplane.

plot(x1,x2,col=colours,xlim = c(0,5) , ylim = c(0,5))
abline(-0.55,1.1)

Then our new hyperplane will have an equation such as :

\(0.55 + X_1 - 1.1X_2 = 0\) (Not the optimal hyperplane)

(h) Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.

Answer

Lets add for example 8th observation,the pair of (4,2) class of “red”. Then our plot will be :

plot(x1,x2,col=colours,xlim = c(0,5) , ylim = c(0,5))
points(4,2,col="red")
text(x=4.5,y=2,labels = "8th observation",cex=0.7) 

We can easily observe that there is not a linear seperation between our two classes and we have to examine different ways (e.g. polynomial, radial) to seperate our classes successfully.

6. Hierarchical clustering

Consider the USArrests data. We will now perform hierarchical clustering on the states.

(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

Answer

library(ISLR)
set.seed(1)
summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00
hc.complete <- hclust(dist(USArrests),method="complete")
plot(hc.complete, main = "Complete Linkage",xlab = "",ylab="",cex=0.9)

The hclust() function implements hierarchical custering while we used “complete” (linkage clustering) method on USArrests data set from ISLR library.

(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

Answer

First, we will use cutree function with three as an argument in order to cut our tree in three distinct clusters

cutree(hc.complete,3)
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2

From the table above we can see exactly the states and the cluster which are belong to. For example Michigan belongs to the first cluster,Pennsylvania to the third cluster and New Jersey to the second cluster. With the “table” function we can also observe that twenty states in total belong to the thrird cluster,sixteen states belong to the first cluster and lastly fourteen states belong to the second cluster.

table(cutree(hc.complete,3))
## 
##  1  2  3 
## 16 14 20

(c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

Answer

xsc <- scale(USArrests) #Used to scale the variables before performing Hierarchically cluster
hc.scomplete <- hclust(dist(xsc), method = "complete")
plot(hc.scomplete,main = "US Arrests (scaled)")

As the variables are scaled to have standard deviation one,then, each variable will effect equally our Hierarchical clustering method and the same importance will be given to each one of them.

(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

Answer

cutree(hc.scomplete,3)
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              2              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              3              1              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              2              3              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              1              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              3              3              3
table(cutree(hc.scomplete,3))
## 
##  1  2  3 
##  8 11 31

Scaling the variable effects not only the height of our dendrogram, but also the clusters,where our first cluster after scaling has eight states instead of 16, the second has eleven instead of fourteen and the third cluster has thirty-one states instead of twenty states.

Our dataset contains statistics in arrests per 100,000 residents for Assualt, Murder and Rape in each of the 50 US states. Although the population living in the urban areas is measured in percentage. In my opinion,we have to scale our data mostly because our variables are measured on different scales,otherwise the choice of units(per-100.000 and percent) will greatly affect the dissimilarity measure obtained.

7. PCA and K-Means Clustering

In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

  1. Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.

Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.

Answer

set.seed(1)
data <- matrix(nrow=60, ncol=50) 

# for i in every column we split our data in to 3 different regions in order to create three different classes. We choose mean sample from 1 to 100 and for the different classes we choose different values for mean (e.g. mean=6 and mean=-6)

for(i in 1:ncol(data)) {
  mean <- sample(1:100, size = 1)
  data[1:20,i] <- rnorm(20,mean=6)
  data[21:40,i]<- rnorm(20, mean=0)
  data[41:60,i]<- rnorm(20, mean=-6)
}
plot(data)

(b) Perform PCA on the 60 observations and plot the first two principal components??? eigenvector. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component eigenvectors.

Answer

First, we create data_pca which is our PCA method on the 60 observations and after that we plot the first two principal compontents eigenvector with different colours in order to indicate our three different classes.

data_pca <- prcomp(data, scale = TRUE)
plot(data_pca$x[,1],data_pca$x[,2],col=c(rep("green",20),rep("red",20),rep("purple",20)),pch = 20, xlab = "First Principal Component",ylab = "Second Principal Component ")

(c) Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?

Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.

Answer

We performed K-means algorithm with K = 3 and nstart = 20.Initially, K-means algorithm randomly assigns classses and computes clusters centroids. After that, it assigns observations to the closest cluster centroid and then repeats the process for as many iterations as we have defined in nstart.

From the table, we can see that our K-means method had succesfully divided our classes in three different subgroups of 20 observations each, which is the same as our simulated data set.

set.seed(1)
km.out <- kmeans(data, 3, nstart = 20)
km.out$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##    
##      1  2  3
##   1 20  0  0
##   2  0 20  0
##   3  0  0 20

(d) Perform K-means clustering with K = 2. Describe your results.

set.seed(1)
km.out <- kmeans(data, 2 , nstart = 20)

km.out$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##    
##      1  2  3
##   1 20 20  0
##   2  0  0 20

In this case,K-means algorithm with K=2 forced our simulated data to be seperated by 2 clusters So the second class “2”(observations 21-40) have been absorbed and now we have only 2 classes “1” and “2”. To be more specific, the first class has absorbed all the points from the class “2” and from the table we can see that we have 40 observations in class “1”. On the other hand, class “3” has its initial 20 observations.

It is more clearly if we observe the plot below.(class “1” contains observations 1-40, class “3” contains observations 41-60)

plot(km.out$cluster)

(e) Now perform K-means clustering with K = 4, and describe your results.

set.seed(1)
km.out <- kmeans(data, 4 , nstart = 20)

km.out$cluster
##  [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 3 3 1 3 3 1 3 3 1 1 1 3 1 1 3 1 3 1 1 3
table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##    
##      1  2  3
##   1  0  0 10
##   2  0 20  0
##   3  0  0 10
##   4 20  0  0

In this case,K-means algorithm with K=4 has seperated our simulated data in 4 clusters(randomly assigned class labels “1”,“2”,“3” and “4”). We can observe, that our first two classes have maintened 20 observations each (in our case class “2” and class “4”). On the other hand, our third class has been divided exactly in 2 different clusters “3” and “1” of 10 observations each.

From the graph below, we can also see how our third class (observations 41-60) has been divided into two new clusters “3” and “1”.

plot(km.out$cluster)

(f) Now perform K-means clustering with K = 3 on the first two principal components, rather than on the raw data. That is, perform K-means clustering on the 60 ?? 2 matrix of which the first column is the first principal component???s corresponding eigenvector, and the second column is the second principal component???s corresponding eigenvector. Comment on the results.

Answer

summary(data_pca)
## Importance of components:
##                           PC1     PC2     PC3    PC4     PC5     PC6
## Standard deviation     6.9270 0.38928 0.37229 0.3465 0.33571 0.32736
## Proportion of Variance 0.9597 0.00303 0.00277 0.0024 0.00225 0.00214
## Cumulative Proportion  0.9597 0.96269 0.96546 0.9679 0.97012 0.97226
##                            PC7     PC8     PC9    PC10   PC11    PC12
## Standard deviation     0.31564 0.30591 0.29471 0.27916 0.2735 0.27046
## Proportion of Variance 0.00199 0.00187 0.00174 0.00156 0.0015 0.00146
## Cumulative Proportion  0.97425 0.97613 0.97786 0.97942 0.9809 0.98238
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.26307 0.26148 0.23994 0.23776 0.23237 0.22623
## Proportion of Variance 0.00138 0.00137 0.00115 0.00113 0.00108 0.00102
## Cumulative Proportion  0.98376 0.98513 0.98628 0.98741 0.98849 0.98952
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.22256 0.21446 0.20148 0.19804 0.19377 0.18047
## Proportion of Variance 0.00099 0.00092 0.00081 0.00078 0.00075 0.00065
## Cumulative Proportion  0.99051 0.99143 0.99224 0.99302 0.99378 0.99443
##                           PC25    PC26    PC27    PC28   PC29    PC30
## Standard deviation     0.17743 0.16316 0.16040 0.14963 0.1414 0.13455
## Proportion of Variance 0.00063 0.00053 0.00051 0.00045 0.0004 0.00036
## Cumulative Proportion  0.99506 0.99559 0.99610 0.99655 0.9970 0.99731
##                           PC31   PC32    PC33    PC34    PC35   PC36
## Standard deviation     0.12598 0.1230 0.11774 0.11283 0.10614 0.0999
## Proportion of Variance 0.00032 0.0003 0.00028 0.00025 0.00023 0.0002
## Cumulative Proportion  0.99763 0.9979 0.99821 0.99846 0.99869 0.9989
##                           PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.09753 0.09384 0.08487 0.08134 0.07672 0.06547
## Proportion of Variance 0.00019 0.00018 0.00014 0.00013 0.00012 0.00009
## Cumulative Proportion  0.99908 0.99926 0.99940 0.99953 0.99965 0.99974
##                           PC43    PC44    PC45    PC46    PC47    PC48
## Standard deviation     0.05734 0.05104 0.04689 0.04547 0.03597 0.02874
## Proportion of Variance 0.00007 0.00005 0.00004 0.00004 0.00003 0.00002
## Cumulative Proportion  0.99980 0.99985 0.99990 0.99994 0.99996 0.99998
##                           PC49    PC50
## Standard deviation     0.02356 0.01966
## Proportion of Variance 0.00001 0.00001
## Cumulative Proportion  0.99999 1.00000

To begin with,we can observe that our first principal component is a reflection of 95.97% of the variations in the data and the second principal component is 0.03%. The cumulative proportion of those PCA 1 and PCA 2 is 96.3% , by that we mean that the first two principal components explain around 96.3% of the variance in our simulated data. We can expect that by performing K-mean method with K=3 we will see the same separation into three clusters of 20 observations each.

We can also use a plot in order to see the proportion of variance explained in our simulated data set by each principal component.

plot(data_pca,type="l")

set.seed(1)

km.out <- kmeans(data_pca$x[,1:2], centers = 3, nstart=20)

summary(km.out$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       2       2       3       3
km.out$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

As we expected, by performing K-means algorithm with K=3 we have separated our simulated data in 3 clusters “1”,“2” and “3” which one of them contains 20 observations as when we used K-means method in the whole dataset(50 variables of 60 observations).We can also see it clearly from the table below.

table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##    
##      1  2  3
##   1 20  0  0
##   2  0 20  0
##   3  0  0 20

(g) Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.

set.seed(1)

data.scaled <- scale(data, scale = TRUE)

kmscaled.out <- kmeans(data.scaled,centers=3)

kmscaled.out$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

First we need to scale our data so each variable to have standard deviation one, which is by default the case in scale() function. After that, we use K-means clustering with K=3 on our new scaled data. The results are the same as those we have obtained in question b. We have three different clusters “1”,“2” and “3” which each containing 20 observations.We can see it clearly from the table below.

table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##    
##      1  2  3
##   1 20  0  0
##   2  0 20  0
##   3  0  0 20

Of course the result is what we expected to have, as we initially generated our variables measured in the same type of units and the same order of magnitude. Then scaling wil not affect our result.

Although, in real life data scaling is recommended as we often have variables measured by different type of units and different order of magnitude. That would mean that some variables will seem to be more important than they actually are and that will have a negative effect in our analysis.